DSCI 100 Group 17 Report: Classifying Celestial Bodies from Spectral Characteristics¶

Group members:

  • Aidan Wong
  • Ben Tyler
  • Tyson Quan

Introduction¶

Stars are large spheres of hot gas that emit heat and light into space. They are composed of mostly hydrogen, with some helium and other elements. The sun is an example of a star and is the closest star to Earth (NASA).

Galaxies are clusters of planets, stars, gasses, and dust that are all held together by gravity. Galaxies are very large and emit light from the stars and other things that it contains. The Milky Way Galaxy is an example of a galaxy and is the one that Earth is a part of (NASA).

Quasars are the core of active galaxies and they are powered by supermassive black holes. They emit immense amounts of heat and light due to the friction of material being drawn in. The closest quasar to Earth is called 3C 273 and can be seen with an 8-inch telescope (Cooper 2018).

The classification of celestial objects into sta rs, galaxies, and quasars has been pivotal for the understanding of planet Earth's positioning within space. It has led to key insights such as the discovery that the Andromeda galaxy is separate from our own, and this classification continues to be essential for astrological research (Clarke 2020).

In this report, we will use data on celestial objects to answer the following question: "Based on its redshift and brightness in different wavelengths of light, what type of celestial object is this?"

Our data set is from Sloan Digital Sky Survey Data Release 16. It was collected by the Sloan Digital Sky Survey Telescope; a powerful telescope aimed at measuring spectral characteristics. It contains data on light emitted from galaxies, quasars, and stars, including redshift, which reflects how quickly an object moves (Fedesoriano, 2022), and brightness in five wavelengths of light. Below are a list of the variables collected as well as what they represent:

  • obj_ID = Object Identifier, the unique value that identifies the object in the image catalog used by the CAS
  • alpha = Right Ascension angle
  • delta = Declination angle
  • u = Ultraviolet filter in the photometric system
  • g = Green filter in the photometric system
  • r = Red filter in the photometric system
  • i = Near Infrared filter in the photometric system
  • z = Infrared filter in the photometric system
  • run_ID = Run Number used to identify the specific scan
  • rereun_ID = Rerun Number to specify how the image was processed
  • cam_col = Camera column to identify the scanline within the run
  • field_ID = Field number to identify each field
  • spec_obj_ID = Unique ID used for optical spectroscopic objects (this means that 2 different observations with the same spec_obj_ID must share the output class)
  • class = object class (galaxy, star, or quasar object)
  • redshift = redshift value based on the increase in wavelength
  • plate = plate ID, identifies each plate in SDSS
  • MJD = Modified Julian Date, used to indicate when a given piece of SDSS data was taken
  • fiber_ID = fiber ID that identifies the fiber that pointed the light at the focal plane in each observation

We will focus on the u, g, r, i, z, and redshift variables to help predict the classification of the class variable.

Import Libraries¶

In [ ]:
import pandas as pd
import altair as alt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.compose import make_column_selector
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Settings on Juypter Notebook for printing and plotting graphs
set_config(transform_output="pandas")
alt.data_transformers.disable_max_rows()

# Seed to ensure reproducible report
np.random.seed(1234)

Methods and Results¶

This section consist of 3 main parts:

  1. Loading and Cleaning Data
  2. Exploratory Data Analysis
  3. Classifcation Analysis

Methods¶

In this section, we will explain the method we used to illustrate our findings.

Firstly, We would use 6 variables (columns): u, g, r, i, z and redshift. The first five variables are brightness values in different bands of light: ultraviolet, green, red, near-infrared, and infrared (Fukugita et al., 1996). They are measured in magnitude, which is unitless and reflects photon abundance (SDSS Voyages, 2024a). These magnitudes could help determine object class because quasars, galaxies, and stars can have unique colours (SDSS, n.d.a). We would also include redshift, which indicates the lengthening of an object's light wavelengths due to the expansion of the universe (SDSS, 2024b). Galaxies and quasars often have higher redshift values than stars, so higher redshift could indicate them (Crockett, 2021).

After gathering the necessary data, we proceeded with data preprocessing, which involved filtering and renaming the columns to ensure comprehensibility and ease of use. Once the dataset was cleaned and prepared, we conducted an exploratory analysis to gain a thorough understanding of the data. Initially, we examined the data types of each column and assessed the distribution of classes. Since we planned to perform K-Nearest Neighbor (KNN) classification, achieving a balanced distribution of classes was crucial for accurate results. To identify suitable variables for classification, we employed visualization techniques such as density plots, which helped us analyze the distinct characteristics exhibited by each variable.

After completing the exploratory analysis, we proceeded with the classification analysis using six selected variables. As the class distribution in the original dataset was imbalanced, we performed upsampling to create a balanced dataset. Subsequently, we followed the standard procedure for KNN classification. This involved standardizing the numerical variables and splitting the dataset into training and testing sets. To determine the optimal parameter k, we conducted cross-validation on the training dataset.

To visualize the results of the cross-validation, we created a plot of k values against estimated accuracy, which aided in selecting the appropriate k value. Next, we evaluated the performance of our classification model using scoring functions and cross-tabulation analysis to gain a comprehensive understanding of the model's results. Additionally, we plotted a pairplot to explore the relationships between each parameter used in the classification model. Based on these findings, we repeated the same procedure for a new set of chosen variables.

Upon completing both models, we reached conclusions based on our findings, which are presented in the following section.

1. Loading and Cleaning Data¶

In [ ]:
# Load in the data file from the web
url="https://drive.google.com/file/d/1LM-kB1xP90O9RBY5yjRP1mET_BKOOhxC/view?usp=sharing"
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
star_data = pd.read_csv(url) #Citation: (Pandas, 2019)

star_data.head()
Out[ ]:
objid ra dec u g r i z run rerun camcol field specobjid class redshift plate mjd fiberid
0 1237666301628060000 47.372545 0.820621 18.69254 17.13867 16.55555 16.34662 16.17639 4849 301 5 771 8168632633242440000 STAR 0.000115 7255 56597 832
1 1237673706652430000 116.303083 42.455980 18.47633 17.30546 17.24116 17.32780 17.37114 6573 301 6 220 9333948945297330000 STAR -0.000093 8290 57364 868
2 1237671126974140000 172.756623 -8.785698 16.47714 15.31072 15.55971 15.72207 15.82471 5973 301 1 13 3221211255238850000 STAR 0.000165 2861 54583 42
3 1237665441518260000 201.224207 28.771290 18.63561 16.88346 16.09825 15.70987 15.43491 4649 301 3 121 2254061292459420000 GALAXY 0.058155 2002 53471 35
4 1237665441522840000 212.817222 26.625225 18.88325 17.87948 17.47037 17.17441 17.05235 4649 301 3 191 2390305906828010000 GALAXY 0.072210 2123 53793 74
In [ ]:
# Cleaning data
# Filter relevant columns and rename columns for a more comprehensible understanding
star_filtered = (
    star_data.loc[:, ["u", "g", "r", "i", "z", "redshift", "class"]]
    .rename(columns={
        "u":"Ultraviolet", 
        "g":"Green", 
        "r":"Red", 
        "i":"Near Infrared", 
        "z":"Infrared",
        "redshift":"Redshift",
        "class":"Class"
    })
)
star_filtered.head()
Out[ ]:
Ultraviolet Green Red Near Infrared Infrared Redshift Class
0 18.69254 17.13867 16.55555 16.34662 16.17639 0.000115 STAR
1 18.47633 17.30546 17.24116 17.32780 17.37114 -0.000093 STAR
2 16.47714 15.31072 15.55971 15.72207 15.82471 0.000165 STAR
3 18.63561 16.88346 16.09825 15.70987 15.43491 0.058155 GALAXY
4 18.88325 17.87948 17.47037 17.17441 17.05235 0.072210 GALAXY

2. Exploratory Data Analysis¶

This section performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis

In [ ]:
# General understanding of the dataset
star_filtered.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Ultraviolet    100000 non-null  float64
 1   Green          100000 non-null  float64
 2   Red            100000 non-null  float64
 3   Near Infrared  100000 non-null  float64
 4   Infrared       100000 non-null  float64
 5   Redshift       100000 non-null  float64
 6   Class          100000 non-null  object 
dtypes: float64(6), object(1)
memory usage: 5.3+ MB
In [ ]:
# Understand the proportion of classes in the dataset to determine whether we have to upsample the data or not
star_filtered["Class"].value_counts(normalize=True)
Out[ ]:
Class
GALAXY    0.51323
STAR      0.38096
QSO       0.10581
Name: proportion, dtype: float64

From the above information we are able to understand a data types of our dataset and the proportion of the classes.

With these information we can conclude that we would have to upsample the dataset to have a fair classification of celestial bodies.

This section creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis

In [ ]:
# Standardizing the data for plotting in the below sections
preprocessor_keep_all = make_column_transformer(
    (StandardScaler(), ['Ultraviolet', 'Green', 'Red', 'Near Infrared', 'Infrared', "Redshift"]),
    remainder="passthrough",
    verbose_feature_names_out=False
)

# Use Fit to compute all the neccessary values to scale the data
preprocessor_keep_all.fit(star_filtered)

# transform function to apply the standardization
star_scaled = preprocessor_keep_all.transform(star_filtered)

star_scaled.head()
Out[ ]:
Ultraviolet Green Red Near Infrared Infrared Redshift Class
0 0.065633 -0.272293 -0.287759 -0.230598 -0.226791 -0.389669 STAR
1 -0.194147 -0.103121 0.317192 0.580613 0.705310 -0.390143 STAR
2 -2.596213 -2.126356 -1.166443 -0.746957 -0.501159 -0.389555 STAR
3 -0.002769 -0.531149 -0.691260 -0.757044 -0.805267 -0.257025 GALAXY
4 0.294775 0.479099 0.519436 0.453794 0.456601 -0.224904 GALAXY

In the below code, we would like to plot a density plot as density plots are more effective for comparing multiple distributions.

With this density distribution, we would like to identify any variables that exhibits difference distributions between the different clusters (E.g. Star, Galaxy or Quasar).

In [ ]:
# Plotting the distribution of different characteristics values based on their class.
star_exploration_plot = alt.Chart(
    star_scaled.melt(
        id_vars=["Class"],
        var_name="Characteristics",
        value_name="Values",
    )
).transform_density(
    "Values",
    groupby=["Class", "Characteristics"],
    as_=["Values", "Density"]
).mark_area(opacity=0.6).encode(
    x=alt.X("Values").scale(base=10),
    y=alt.Y("Density:Q", title="Density"),
    color="Class:N"
).properties(
    width=150,
    height=150
).facet(
    alt.Facet(
        "Characteristics",
        sort=star_scaled.columns[:-1].tolist()
    ),
    columns=6
).resolve_scale(
    # We are setting the x-scale to "independent" since we standardized the rating values,
    # which means that their original range (which is what we show here) does not matter
    x="independent",
    y="independent"
)

star_exploration_plot 
Out[ ]:

As shown in the diagram, different classes tend to exhibit different attribute of characteristics based on the selected variables as their distribution differs from each other.

Hence, with this knowledge we are able to perform the KNN classification based on the above analysis and deduction.

Exploration Analysis Conclusion¶

In Conclusion, we would have to upsample our dataset as the proportion of class is unbalance. We are able to conduct classification on the following variables: Ultraviolet, Green, Red, Near Infrared, Infrared and Redshift as they exhibit different distribution for a given class.

3. Classification Analysis¶

In [ ]:
# Splitting the data into training and testing data. We added stratify to ensure the classes of objects are distributed evenly in testing and training data.
star_train, star_test = train_test_split(
    star_filtered, train_size=0.75, stratify=star_filtered["Class"]
)

# We upsample the star and quasar classes to train our model on training data with equal proportions of classes, as quasars and stars are underrepresented.
QSO_train = star_train[star_train["Class"] == "QSO"]
STAR_train = star_train[star_train["Class"] == "STAR"]
GALAXY_train = star_train[star_train["Class"] == "GALAXY"]
QSO_upsample = QSO_train.sample(
    n=GALAXY_train.shape[0], replace=True
)
STAR_upsample = STAR_train.sample(
    n=GALAXY_train.shape[0], replace=True
)
upsampled_star_train = pd.concat((QSO_upsample, STAR_upsample, GALAXY_train))
In [ ]:
# We create our K-neighbours classifier object and preprocessor to standardize the training data. 
star_knn_1 = KNeighborsClassifier()

star_preprocessor = make_column_transformer(
    (StandardScaler(), ["Ultraviolet", "Green", "Red", "Near Infrared", "Infrared", "Redshift"])
)

star_pipeline = make_pipeline(star_preprocessor, star_knn_1)

# We use GridSearchCV to tune our model to estimate the k value with the most accuracy.
parameter_grid ={
    "kneighborsclassifier__n_neighbors" : range(2,15,1),
}

star_tune = GridSearchCV(
    star_pipeline,
    parameter_grid,
    cv=5,
    return_train_score=True,
    n_jobs=-1
)

star_model = star_tune.fit(upsampled_star_train[["Ultraviolet", "Green", "Red", "Near Infrared", "Infrared", "Redshift"]], upsampled_star_train["Class"])

star_accuracy = pd.DataFrame(star_model.cv_results_)
In [ ]:
# We create a plot of k values against estimated accuracy to choose our k value.
accuracy_plot = alt.Chart(star_accuracy).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors").title("Neighbors").scale(zero=False),
    y=alt.Y("mean_test_score").title("Accuracy estimate").scale(zero=False)
)
accuracy_plot
# Our accuracy plot shows that we should use k = 2 in our classification model, as it is estimated to provide the highest accuracy.
Out[ ]:
In [ ]:
# We use the tuned model to predict the class of each testing data observation in a new column in the testing data frame called "Prediction."
# The best neighbours value found two cells prior is stored in the star_tune object.
star_test["Prediction"] = star_tune.predict(
    star_test[["Ultraviolet", "Green", "Red", "Near Infrared", "Infrared", "Redshift"]]
)
In [ ]:
# This cell outputs the accuracy of our model on the training data, which is 96.008%.
star_tune.score(
    star_test[["Ultraviolet", "Green", "Red", "Near Infrared", "Infrared", "Redshift"]],
    star_test["Class"]
)
Out[ ]:
0.96008
In [ ]:
# This cell outputs a confusion matrix for the test data, with each row representing the true class value, and each column representing the classes our model predicted.
pd.crosstab(star_test["Class"], star_test["Prediction"])
Out[ ]:
Prediction GALAXY QSO STAR
Class
GALAXY 12424 75 332
QSO 111 2529 5
STAR 465 10 9049

Pair plot

In [ ]:
# This cell outputs a pair plot, which plots a subsample of the data for each in each variable.
# This code was adapted from the Regression 2 Tutorial.
columns_to_plot = ["Ultraviolet", "Green", "Red", "Near Infrared", "Infrared", "Redshift"]

star_train_sample = star_train.sample(n=1000)

pairplot = alt.Chart(star_train_sample).mark_point().encode(
    alt.X(alt.repeat("row"), type="quantitative").scale(zero=False),
    alt.Y(alt.repeat("column"), type="quantitative").scale(zero=False),
).properties(
    width=200,
    height=200
).repeat(
    column=columns_to_plot,
    row=columns_to_plot
)
pairplot
Out[ ]:

Removed green and infrared

In [ ]:
# In the pair plot, we observed that the green, near infrared, and infrared predictor variables have a relatively strong positive correlation with the red variable.
# Therefore, we perform a classification without the green, near infrared, and infrared as predictors to compare the accuracy to our previous model.

star_knn_2 = KNeighborsClassifier()

star_preprocessor_2 = make_column_transformer(
    (StandardScaler(), ["Ultraviolet", "Red", "Redshift"])
)

star_pipeline_2 = make_pipeline(star_preprocessor_2, star_knn_2)

parameter_grid_2 ={
    "kneighborsclassifier__n_neighbors" : range(2,15,1),
}

star_tune_2 = GridSearchCV(
    star_pipeline_2,
    parameter_grid_2,
    cv=5,
    return_train_score=True,
    n_jobs=-1
)

star_model_2 = star_tune_2.fit(upsampled_star_train[["Ultraviolet", "Red", "Redshift"]], upsampled_star_train["Class"])

star_accuracy_2 = pd.DataFrame(star_model_2.cv_results_)
In [ ]:
# We create a plot of k values against estimated accuracy to choose our k value.
accuracy_plot_2 = alt.Chart(star_accuracy_2).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors").title("Neighbors").scale(zero=False),
    y=alt.Y("mean_test_score").title("Accuracy estimate").scale(zero=False)
)
accuracy_plot_2
# Our accuracy plot shows that we should use k =  in our classification model, as it is estimated to provide the highest accuracy.
Out[ ]:
In [ ]:
# We use the new tuned model to predict the class of each testing data observation in a new column in the testing data frame called "Prediction_2".
# The best neighbours value found two cells prior is stored in the star_tune object.
star_test["Prediction_2"] = star_tune_2.predict(
    star_test[["Ultraviolet", "Green", "Red", "Near Infrared", "Infrared", "Redshift"]]
)
In [ ]:
# This cell outputs the accuracy of our model on the training data, which is 95.308%.
star_tune_2.score(
    star_test[["Ultraviolet", "Red", "Near Infrared", "Redshift"]],
    star_test["Class"]
)
Out[ ]:
0.95308
In [ ]:
# This cell outputs a confusion matrix for the test data, with each row representing the true class value, and each column representing the classes our model predicted.
pd.crosstab(star_test["Class"], star_test["Prediction_2"])
Out[ ]:
Prediction_2 GALAXY QSO STAR
Class
GALAXY 12353 150 328
QSO 173 2467 5
STAR 516 1 9007
In [ ]:
# Pair plot showing redshift versus brightness magnitude for each light band, with point colours corresponding to the object class.
columns_to_plot = ["Ultraviolet", "Green", "Red", "Near Infrared", "Infrared", "Redshift"]

star_test_sample = star_test.sample(n=1000)

pairplot_2 = alt.Chart(star_test_sample).mark_point().encode(
    alt.X(alt.repeat("row"), type="quantitative").scale(zero=False),
    alt.Y(alt.repeat("column"), type="quantitative").scale(zero=False),
    color=alt.Color("Prediction").title("Prediction")
).properties(
    width=200,
    height=200
).repeat(
    column=columns_to_plot,
    row=["Redshift"]
)
pairplot_2
Out[ ]:

Discussion¶

Findings: We found that our first classification model, using all predictor variables, had an accuracy of 96.008%, meaning that 96% of the time, it correctly classified the test data objects. The accuracy of our second model was slightly lower; its accuracy was 95.308%, so it correctly classified the test data about 95% of the time. Both these accuracy values are quite high, as more than 19 times out of 20, our model was correct on testing data.

Expectations:

  • Starting this project, we expected to be able to classify astronomical objects as stars, galaxies, or quasars based on their redshift and light emissions in five wavelengths. Our results show that we can achieve similar classification accuracy using redshift and only two wavelengths of light, red and ultraviolet, as opposed to all five.
  • In our exploratory data analysis, the distributions of stars and galaxies are fairly similar in the light magnitude variables. Therefore, we considered that our models might have trouble distinguishing between galaxies and stars. However, only 3% of galaxies in both models were incorrectly classified as stars, which did not match our expectation.
  • We had also expected objects with high brightness and large redshift to be classified as quasars. In our last figure, many of the points in the top middle and right areas of the graph are orange, indicating they were classified as quasars. However, our confusion matrices showed that 4% and 7% of quasars in our first and second models were classified as galaxies. Perhaps these galaxies were particularly bright with high redshift, or they could have been dim quasars with low redshift, suggesting an area our models could improve.
  • Our final plot shows that objects we classified as stars generally have the lowest redshift values but can extend to the highest brightness values of all objects. This fulfills our expectation of finding an additional patterns between the object’s class and its redshift and brightness values.

Impacts of findings:

  • Our models could potentially be used to categorize astronomical objects based on real brightness and redshift data. Our first model would likely slightly higher accuracy given its better accuracy score, while our second model only uses three predictor variables instead of six, so it could be less computationally expensive. This classification could allow astronomers to find new properties present in the different object classes to expand our astronomical knowledge on stars, quasars, and galaxies.
  • These models could also be used to observe how quasars, stars, and galaxies are distributed in the night sky. This could improve our knowledge on distributional and clustering patterns of these objects.
  • After classification with our models, researchers could also examine the objects’ properties at different redshifts. Given that more distant objects tend to have higher redshifts, perhaps more distant quasars, stars, and galaxies show unique characteristics compared to closer ones (Lloyd, n.d.).

Future Questions:

  • While our model might distinguish between Galaxy, Star and Quasar, we could pursue further subcategorization. We might ask: What type of galaxy is this? Perhaps this could reveal that a certain galaxy type has particularly high brightness, as seen in our final figure.
  • We could explore how different magnitudes of light bands correlate with physical attributes like size. After classifying objects with our models, one could investigate the size of these objects to see if relationships between light magnitude and size exists.
  • To improve our models’ accuracies, another question arises: What are additional helpful predictor variables we could introduce? For example, we might consider variables like the chemical composition of these objects (Center for Astrophysics, n.d.). If this variable was different between objects, it could provide distinct signals detectable by the classifier to raise model accuracy.

References¶

  • Camera. SDSS. (2022a). https://www.sdss4.org/instruments/camera/#Filters
  • Clarke, A. O., Scaife, A. M. M., Greenhalgh, R., & Griguta, V. (2020, July 13). Identifying galaxies, quasars, and stars with Machine Learning: A new catalogue of classifications for 111 million SDSS sources without spectra. Astronomy & Astrophysics. https://www.aanda.org/articles/aa/full_html/2020/07/aa36770-19/aa36770-19.html#R16
  • Cooper, K. (2018, February 24). Quasars: Everything you need to know about the brightest objects in the universe. Space.com. https://www.space.com/17262-quasar-definition.html
  • Data release 17. SDSS. (2022b). https://www.sdss4.org/dr17/
  • Fedesoriano. (2022, January 15). Stellar classification dataset - SDSS17. Kaggle. https://www.kaggle.com/datasets/fedesoriano/stellar-classification-dataset-sdss17/data
  • Libretexts. (2023, January 5). 6: Redshifts. Physics LibreTexts. https://phys.libretexts.org/Courses/University_of_California_Davis/UCD%3A_Physics_156_-_A_Cosmology_Workbook/Workbook/06._Redshifts_(INCOMPLETE)
  • NASA. (n.d.-a). Galaxies. NASA. https://science.nasa.gov/universe/galaxies/
  • NASA. (n.d.-b). Stars. NASA. https://science.nasa.gov/universe/stars/
  • Fukugita, M., Ichikawa, Y., Gunn, J. E., Doi, M., Shimasaku, K., & Schneider, D. P. (1996). The Sloan Digital Sky Survey Photometric System. The Astronomical Journal, 111(4), 1748-1756. https://doi.org/10.1086/117915